Supplementary material for “Lighter: fast and memory-efficient error correction without counting”
نویسندگان
چکیده
For a standard Bloom filter, each of the h hash functions could map item o to any element of the bit array. The bit array will often be very large, much larger than the processor cache. Thus, each probe into the bit array is likely to cause a cache miss. Putze et al [5] propose a blocked Bloom filter. Given a block size b, the first hash function H0(o) is used to select a size-b block of consecutive positions in the bit array. Then, H1(o), ..., Hh−1(o) map o onto elements of that block. When b is less than or equal to the size of a cache line, the h accesses will tend to cause only one or two cache misses, rather than approximately h cache misses. The drawback is that h and m must be somewhat larger to achieve the same false positive rate (FPR) as a corresponding standard Bloom filter. To estimate the FPR of the blocked Bloom filter, we can consider each of the possible m − b + 1 blocks. For the i-th block, the FPR within the block is (b′i/b) , where b′i is the number of bits set to 1 in block i. So the overall FPR is:
منابع مشابه
Lighter: fast and memory-efficient error correction without counting
Correspondence: [email protected] Department of Computer Science, Johns Hopkins University, 21218, Baltimore, USA Full list of author information is available at the end of the article Abstract Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the othe...
متن کاملA fast, lock-free approach for efficient parallel counting of occurrences of k-mers
MOTIVATION Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be...
متن کاملSqueakr: an exact and approximate k-mer counting system
Motivation k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g. for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations ...
متن کاملMSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting
Motivation: A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many compone...
متن کاملOptimal fast digital error correction method of pipelined analog to digital converter with DLMS algorithm
In this paper, convergence rate of digital error correction algorithm in correction of capacitor mismatch error and finite and nonlinear gain of Op-Amp has increased significantly by the use of DLMS, an evolutionary search algorithm. To this end, a 16-bit pipelined analog to digital converter was modeled. The obtained digital model is a FIR filter with 16 adjustable weights. To adjust weights o...
متن کامل